Deploying Model Services

Multiple options exist for launching model services, including vLLM, Ollama, LLaMA, TGI, etc. vLLM is recommended for deploying large language model services.

vLLM Image Installation

vLLM (Very Large Language Models) is an efficient framework for large language model inference and deployment, developed by UC Berkeley. By optimizing memory management and computational resource utilization, vLLM enables efficient deployment of large language models. It supports installation on local or cloud environments and accelerates on various hardware platforms including GPUs and CPUs.

Use the latest version of vllm-openai image. Version used in this documentation: v0.9.0.1.

Online and offline installation options available. Online installation recommended.

Online Installation (Recommended)

Pull the vllm image directly from docker.io repository:

docker pull vllm/vllm-openai:v0.9.0.1

If docker.io is inaccessible, use the Alibaba Cloud repository image:

docker login --username=478386058@qq.com registry.cn-chengdu.aliyuncs.co

2. Install vllm-openai

docker pull registry.cn-chengdu.aliyuncs.com/supermap-ai/vllm-openai:v0.9.0.1

3. Rename the image

docker tag registry.cn-chengdu.aliyuncs.com/supermap-ai/vllm-openai:v0.9.0.1 vllm/vllm-openai:v0.9.0.1

Offline Installation

If online installation is not feasible, download the offline package from network disk and install with:

docker load -i vllm-openai-v0.9.0.1.tar

Deploying Word Embedding Model Service

iPortal AI Assistant supports specialized knowledge retrieval. This requires storing knowledge in Knowledge Bases and vectorizing documents. Deploy a word embedding model service for vectorization.

Online and offline deployment options available. Online deployment recommended.

Online Deployment (Recommended)

Start vLLM model service to deploy bge-m3 word embedding model:

docker run -d --gpus '"device=0"' -v /opt/models/modelscope:/root/.cache/modelscope -p 8001:8000 --ipc=host --name vllm-bge-m3 vllm/vllm-openai:v0.9.0.1 --model BAAI/bge-m3 --served-model-name bge-m3 --task embedding

Offline Deployment

If online deployment is not possible, download the model locally using Git:

1. Install Git:

apt-get install git

apt-get install git-lfs

git lfs install

2. Create directory and download bge-m3 model:

mkdir -p /opt/models/modelscope/hub/BAAI

cd /opt/models/modelscope/hub/BAAI

git clone https://www.modelscope.cn/BAAI/bge-m3.git

3. Start vLLM model service using local bge-m3 model:

docker run -d --gpus '"device=0"' -v /opt/models/modelscope/hub/BAAI/bge-m3:/root/.cache/modelscope/hub/BAAI/bge-m3 -p 8001:8000 --ipc=host --name vllm-bge-m3 vllm/vllm-openai:v0.9.0.1 --model /opt/models/modelscope/hub/BAAI/bge-m3 --served-model-name bge-m3 --task embedding

Deploying Open-Source LLM Service

iPortal AI Assistant supports large language models with Function Call capability. For optimal results, deploy the Qwen3-14b model service.

Online and offline deployment options available. Online deployment recommended.

Online Deployment (Recommended)

Start vLLM model service to deploy qwen model:

docker run -d --gpus '"device=0,1"' -v /opt/models/modelscope:/root/.cache/modelscope --env "VLLM_USE_MODELSCOPE=true" -p 8000:8000 --ipc=host --name vllm-qwen3-14b vllm/vllm-openai:v0.9.0.1 --model Qwen/Qwen3-14B --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser hermes --tensor-parallel-size 2

Offline Deployment

If online deployment is not possible, download the model locally using Git:

1. Install Git:

apt-get install git

apt-get install git-lfs

git lfs install

2. Create directory and download Qwen3-14B model:

mkdir -p /opt/models/modelscope/hub/Qwen

cd /opt/models/modelscope/hub/Qwen

git clone https://www.modelscope.cn/Qwen/Qwen3-14.git

3. Start vLLM model service using local qwen model

docker run -d --gpus '"device=0,1"' -v /opt/models/modelscope/:/root/.cache/modelscope --env "VLLM_USE_MODELSCOPE=true" -p 8000:8000 --ipc=host --name vllm-qwen3-14b vllm/vllm-openai:v0.9.0.1 --model /root/.cache/modelscope/hub/Qwen/Qwen3-14B --served-model-name Qwen/Qwen3-14B --gpu-memory-utilization 0.95 --enable-auto-tool-choice --tool-call-parser hermes --tensor-parallel-size 2

Parameter	Description
-d	Run service in background
--gpus	'"device=0,1"' - Select GPUs based on model size '"device=0,1"' - Use GPUs with IDs 0 and 1 all - Enable all GPUs
-v /opt/models/...	Mount model directory to container
--env "VLLM_USE_MODELSCOPE=true"	Load models from ModelScope instead of Hugging Face Hub
--name	Specify container name
-p 8000:8000	Port mapping
--ipc=host	Shared memory communication
--model	Model path (within container)
--served-model-name	Model name (API identifier)
--tensor-parallel-size 2	Tensor parallelization across 2 GPUs (required for 32B-GPTQ-Int4 models)
--gpu-memory-utilization 0.95	GPU memory allocation ratio (0-1, default 0.9) Higher values needed for 32B-GPTQ-Int4 models
--enable-auto-tool-choice --tool-call-parser hermes	Enable model function calling